Interactive data visualization for age-sex pyramid

Creating data visualisation beyond default

Min Xiaoqi https://www.linkedin.com/in/xiaoqi-min/ (Master of IT in Business, Singapore Management University)https://scis.smu.edu.sg/master-it-business/financial-technology-and-analytics-track
2022-02-03

1. The task

In this exercise, I will apply appropriate interactivity and animation methods to design an age-sex pyramid based data visualisation to show the changes of demographic structure of Singapore by age cohort and gender between 2000-2020 at planning area level.

For this task, the data sets entitle Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010 and Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020 should be used. These data sets are available at Department of Statistics home page.

2. Data challenges

3. Porposed sketch

4. Installing/loading required packages

The code chunk below is used to ensure that the required R packages are installed.

packages = c('ggiraph', 'plotly', 'DT', 'patchwork', 'gganimate', 'tidyverse', 'readxl', 'gifski', 'gapminder')
for (p in packages){
  if(!require(p, character.only = T)){
  }
  library(p, character.only = T)
}

5. Data import

In this task, Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2000-2010 and Singapore Residents by Planning Area / Subzone, Age Group, Sex and Type of Dwelling, June 2011-2020 data sets will be used which are csv files. The code chunk below imports the datasets into R environment by using read_csv() function of readr package.

set1 <- read_csv('data/respopagesextod2000to2010.csv')
set2 <- read_csv('Data/respopagesextod2011to2020.csv')

6. Data visualization

Two visualizations will be created in this section. The animated age sex pyramid shows the overview of the demographic structure changes of Singapore by age and gender during 2000-2020. The Interactive age-sex pyramid gives a more in-depth visualization of the demographic structure by age, gender, time period at a planning area level.

Animated Age-Sex pyramid

Data wrangling

Combining data sets

rbind is used to combine the two data sets into one data frame since the second data set is continued from the first data set, all column variables are the same.

popdata <- rbind(set1, set2)

Compute population count

In this step, we create a new data frame “popdata_grouped” using group_by() of dplyr package. This is to group the population by its age, sex and time. Then summarise() of dplyr is used to create a column “population” to calculate the sum of population in this particular group.

popdata_grouped <- popdata %>%
  group_by(`Time`,`AG`,`Sex`,) %>%
  summarise(population = sum(Pop))%>%
  ungroup()

Sorting data

Since we would like the population pyramid to be in a descending order with older age at the top of the pyramid, we will use arrange() of dplyr package to sort the population in descending order of age, as shown in the code chunk below.

popdata_sorted <- popdata_grouped %>%
  arrange(desc(AG))

After sorting, we noticed that the two rows corresponding to age group “5_to_9” is not in front of the rows corresponding to age group “0_to_4” due to formatting of the field. Hence, we need to standardize the format of the “AG” field by changing both “5_to_9” and “0_to_4“ to”05_to_09” and “00_to_04” as shown in the following code chunk:

popdata_sorted$AG[popdata_sorted$AG == "0_to_4"] <- "00_to_04"
popdata_sorted$AG[popdata_sorted$AG == "5_to_9"] <- "05_to_09"

As we can see from the “population” column, the sum of population are mostly 5-6 digits which may not be presentable when we plot the graph and show the axis labels. Hence,mutate() of dplyr package is used to create a new column “popinthousands” to convert the sum of population in to population in thousands, as shown below.

popdata_final <- popdata_sorted %>%
  mutate(popinthousands = population/1000)

Data visualization

In order to have both Males and Females data in the same plot, we use ifelse function to set the logical statement. That is, if “Sex” is “Males”, we compute the population count into a negative value since we would like Males portion on the left of the plot. If “Sex” is “Females”, it will be shown on the positive portion of the axis.

geom_col() was used to plot the chart since population pyramids are in bars representing different age groups.

scale_x_continuous() is used to rescale the x axis, mainly to set the axis number marks in the range that we desire, stating the minimun and maximum as well as the stepsize.

Other functions such as theme and scale_fill_manual are used to customize the display and apprearance of the chart.

transition_time() is used for the plot to display the different states representing the specific time period.

ggplot(data=popdata_final, 
       aes(x = ifelse(test = Sex == 'Males', 
                      yes = -popinthousands, 
                      no = popinthousands),
                           y = AG, fill = Sex))+
  geom_col()+
  scale_x_continuous(breaks=seq(-160,160,20),labels = abs(seq(-160,160,20)))+
  labs(title = 'Age-Sex Pyramid in Year: {frame_time}',
       x = "Population Count (in thousands)",
       y = "Age Group")+
  theme_bw()+
  theme(plot.title = element_text(face = 'bold', size = 14))+
  scale_fill_manual(values = c('Males' = 'steelblue2', 'Females' = 'plum2'))+
  transition_time(as.integer(Time))+
  ease_aes('linear')

Interactive Age-Sex pyramid

Data Wrangling

Compute population count

In this step, we create a new data frame “popdata_grouped2” using group_by() of dplyr package. This is to group the population by its planning area, age, sex and time. Then summarise() of dplyr is used to create a column “population” to calculate the sum of population in this particular group.

popdata_grouped2 <- popdata %>%
  group_by(`PA`,`Time`,`AG`,`Sex`,) %>%
  summarise(population = sum(Pop))%>%
  ungroup()

Sorting data

Since we would like the population pyramid to be in a descending order with older age at the top of the pyramid, we will use arrange() of dplyr package to sort the population in descending order of age, as shown in the code chunk below.

popdata_sorted2 <- popdata_grouped2 %>%
  arrange(desc(AG))

After sorting, we noticed that the two rows corresponding to age group “5_to_9” is not in front of the rows corresponding to age group “0_to_4” due to formatting of the field. Hence, we need to standardize the format of the “AG” field by changing both “5_to_9” and “0_to_4“ to”05_to_09” and “00_to_04” as shown in the following code chunk:

popdata_sorted2$AG[popdata_sorted2$AG == "0_to_4"] <- "00_to_04"
popdata_sorted2$AG[popdata_sorted2$AG == "5_to_9"] <- "05_to_09"

As we can see from the “population” column, the sum of population are mostly 5-6 digits which may not be presentable when we plot the graph and show the axis labels. Hence,mutate() of dplyr package is used to create a new column “popinthousands” to convert the sum of population in to population in thousands, as shown below.

popdata_final2 <- popdata_sorted2 %>%
  mutate(popinthousands = population/1000)

creating data frames for male and female individually

In order to have male and female populatoins displayed side by side, we will first create two data frames for male and female individually. This is done by the filter function.

male <- popdata_final2 %>%
  filter(Sex == "Males")

female <- popdata_final2 %>%
  filter(Sex == "Females")

Data visualization

For this data visualization, we will use plot_ly to create 2 graphs for both male and female population, the x axis will be the population in thousands we computed earlier, the y aixs is the age groups, we will define planning area as different colors of the bars that can be filtered later on, and setting the years as the frame. Also, a tooltip is created to include the Year, Planning area, Age group, Sex and Population for a clearer view when the use hover to a specific bar. autorange function is used in the plot for male population to reverse the axis so that it can be displayed on the negative scale. Lastly, subplot() is used to put the two plots side by side with shared x and y axes.

p1 <- plot_ly(data = male, 
              x = ~popinthousands,
              y = ~AG, 
              color = ~PA,
              colors = "Set1",
              frame = ~Time,
              text = ~paste("Year:", Time,
                            "<br>Planning Area:", PA,
                            "<br>Age Group:", AG,
                            "<br>Sex:", Sex,
                            "<br>Population(thousands):", popinthousands)) %>%
        layout(xaxis = list(title = list(text = 'Male', standoff =25), autorange = 'reversed'), 
               yaxis = list(title = list(text = 'Age Group', standoff =25)))

p2 <- plot_ly(data = female, x = ~popinthousands,
              y = ~AG, 
              color = ~PA,
              colors = "Set1",
              frame = ~Time,
              text = ~paste("Year:", Time,
                            "<br>Planning Area:", PA,
                            "<br>Age Group:", AG,
                            "<br>Sex:", Sex,
                            "<br>Population(thousands):", popinthousands)) %>%
        layout(xaxis = list(title = list(text = 'Female', standoff =25)),
               yaxis = list(title = list(text = 'Male', standoff =25)))
subplot (p1,p2, shareX = TRUE, shareY = TRUE)

7. Conclusion

By looking at the animated age-sex pyramid, we can see that there was a significant shrink in the young (aged under 19) population over the years, while a significant boost in the senior (aged between 50-80) population, which infers a aging population problem in Singapore. From the interactive pyramid, planning areas such as Punggol saw a boost in the population especially in the young-middle age band.

Interactive data visualization provides users a customized way of exploring the data and the charts, this also enhances the understanding of the message behind the data available for the user. It allows users to freely navigate through the visualization and gain in-depth information from the plots.